Cross subkey side channel analysis based on small samples

The majority of recently demonstrated Deep-Learning Side-Channel Analysis (DLSCA) use neural networks trained on a segment of traces containing operations only related to the target subkey. However, when the size of the training set is limited, as in this paper with only 5K power traces, the deep learning (DL) model cannot effectively learn the internal features of the data due to insufficient training data. In this paper, we propose a cross-subkey training approach that acts as a trace augmentation. We train deep-learning models not only on a segment of traces containing the SBox operation of the target subkey of AES-128 but also on segments for other 15 subkeys. Experimental results show that the accuracy of the subkey combination training model is \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$28.20\%$$\end{document}28.20% higher than that of the individual subkey training model on traces captured in the microcontroller implementation of the STM32F3 with AES-128. And validation is performed on two additional publicly available datasets. At the same time, the number of traces that need to be captured when the model is trained is greatly reduced, demonstrating the effectiveness and practicality of the method.

www.nature.com/scientificreports/ the effectiveness of cross-subkey training is verified by varying the proportion of target and non-target subkeys in the training sets; (2) adding traces of non-target subkeys to the target subkey training sets expand the training set and can effectively improve the efficiency of the side-channel analysis by twofold. The method is also validated on the home-made dataset AES_STM32 and the publicly available datasets AES_XMEGA, AES_GPU.

Background
This section first reviews AES-128. Afterwards, we briefly introduces deep learning and how to apply deep learning to side-channel analysis. For a broader introduction for deep learning, see 7 . Finally, the three evaluation metrics used in this paper are presented.
AES-128. AES 1 is one of the most widely used symmetric cryptographic algorithm standardized by NIST in FIPS 197 and included in ISO/IEC 18033-3. AES-128 is a subset of AES which takes a 128-bit key K to encrypt a 128-bit block of plaintext P, and the output is a 128-bit block of ciphertext C. AES-128 contains 10 encryption rounds in total and except the last round, each round repeats 4 steps sequentially: SubBytes, ShiftRows, MixColumns and AddRoundKey. The final round does not contain MixColumns. In our experiment, the mode of operation is set to Electronic Codebook (ECB) mode, which first divides the message into blocks and each block is encrypted separately. The SubBytes procedure is a non-linear substitution which maps an 8-bit input to an 8-bit output by using the Substitution Box (SBox). An attack point for side-channel analysis is a selected intermediate state which can be used to describe the power consumed by the victim device during the execution of AES. The selection of attack point is affected by known input data (e.g. plaintext, ciphertext) and physical measurements (e.g. power consumption, EM emissions, timing). Two common points of attack are the first round of SBox output and the last round of SBox input of the AES algorithm. An appropriate attack point will lead to a more efficient attack.
Deep-learning side-channel attack. Deep learning is a subset of machine learning 16 that uses deep neural networks to learn from experience and understand the input data in terms of a hierarchy of concepts. Since deep-learning techniques are good at extracting features in raw data 7,17,18 , deep-learning based SCA become several orders of magnitude more effective than the traditional cryptanalysis. A typical deep-learning side-channel attack can be divided into two stages.
At the profiling stage, the attacker aims to use the deep-learning model to learn a leakage profile by using a large set of power traces T = {T 1 ,T 2 , . . . , T m } captured from the profiling device, where m is the number of traces in the training set. Each trace T i is labeled by the data processed at the attack point l(T i ) ∈ L , where L = {0, 1, . . . , 255} , which can be used to derive the subkey by using some known input (e.g. the plaintext, ciphertext). The process of building a neural network can be viewed as a mapping N : R m → I |L| and the output is a score vector S = N(T) ∈ I |L| . The element s j with value j in S represents the probability that l(T) = j.
At the attack stage, the attacker uses the trained deep-learning model to classify traces captured from the victim device and obtain the score vector. The attacker can find the i th subkey k i = j which has the largest probability in S. We use k * i to denote the real subkey. Once k i = k * i , the subkey is recovered successfully. To quantify the classification error of the neural network, we use the cross-entropy 16 as the loss function and the optimizer is set to RMSprop (Root Mean Square prop).
Evaluation metrics. Accuracy. Model accuracy is defined as the probability of a model achieving correct classification results on a testing set. As one of the most commonly used model evaluation metrics in machine learning, model is used to characterise a model's ability to classify data. An increase in model accuracy accuracy indicates that the backpropagation algorithm's optimization of the weights and bias parameters gradually converges to the correct values, and the model gradually converges to the optimal model. The loss of a model characterises the degree of deviation between a model's predicted and actual values. The smaller the loss, the closer the model's prediction is to the actual value. The loss function used in this experiment is the Categorical Crossentropy. The formula for the accuracy of the model is: where X attack denotes the test dataset, x i denotes the ith power trace in that dataset, k denotes the calculation result, and x i ∈ X attack is the set when the guess keys are all equal to the correct key. The model's accuracy is the ratio of the number of power traces when the guessed key is equal to the correct key to the number of power traces in all the testing sets.
PGE and key rank. However, when traces are noisy 19 , it might be difficult for the model to predict the key with a single traces. In that case, partial guessing entropy (PGE) becomes a more suitable evaluation criterion. PGE indicates the mean rank of the real subkey sorted by the predicted probabilities of all possible subkeys. During the attack stage, we use the trained model to classify traces from the testing set and obtain the probabilities of different keys for each trace. For trace x i ∈ X attack , the obtained probability matrix is denoted as where p i,j in P i is the predicted probability of k=j for trace x i . Where P i is the correct Key Rank, which is usually used as an evaluation criterion for datasets with better signal-to-noise ratios, as the (1) k i = arg max 0≤j≤255s j .
(2) acc(X attack ) = |{x i ∈ X attack }k| X attack . www.nature.com/scientificreports/ number of traces used to recover the correct key for datasets with higher signal-to-noise ratios is usually in the single digits, and using the Key Rank provides a more intuitive evaluation of the results. The lower the number of traces in the Key Rank, the better the model. Afterwards, we apply an element-wise multiplication for all P i to obtain a cumulative probability: where m is the number of traces we used for classification. Then, PGE can be represented as the averaged rank of real key k * sorted by P. Composition of power traces. Power based side-channel analysis utilize the fact that the power consumed during the execution of the encryption process by the victim device might be different according to the different input data and different operations. Therefore, the most interesting parts of a power consumption trace can be defined as a data-dependent component P data and an operation-dependent component P op . Besides, using the same device to repeat the same operation with the same input data will also consume different amount of power for every repetition because of the electronic noise component P noise . Meanwhile, the switching activities of the transistors which are independent from the input data can generate a constant amount of power consumption, which is called the constant component P const . Thus, each point of a power trace can be modeled as the sum of these components 3 .

Cross-subkey attack
Trace augmentation. Deep-learning techniques have performed remarkably well on many side-channel attack scenarios. However, deep learning models are inadequately trained to measure and always suffer from an inability to effectively learn features within the data. Unfortunately, many attackers may not have access to big profiling data, for instance, attackers may not have a full control to the profiling device and can only capture a limited amount of traces. One data-level solution to the problem of limited training data is data augmentation 20 , which aims to use the additional synthetically modified traces to act as a regularizer and helps enhance the fit when training models in the context of side-channel analysis.
In software implementations of AES, leakage is time-dependent since instructions are carried out one by one 21 . This leads to a generally accepted approach for the attack to against software implementation of AES, which is to build a leakage profile between traces and the target subkey. Typically for the 8-bit microcontrollers and microprocessors, the encryption is implemented byte by byte. If the same data is processed by two SBox substitutions, power traces of these two operations could be similar since the the data-dependent components and operation-dependent components in formula 4 are the same. Figure 2 shows power traces captured from an 8-bit microcontroller implementation of AES, which represent the first SBox and the second SBox operations in the first round. One can see that power traces look very similar if the same data is processed by two SBox substitutions. So we could use a small amount of traces related to the non-target subkeys as a regularizer for the www.nature.com/scientificreports/ training set which contains traces only for the target subkey. It is a data augmentation for a specific subkey to build the model with a better fitting capacity. Fig. 1, a trace which contains 16 SBox computations of the first round is first divided into 16 sub-traces. The i th sub-trace is labeled by l i which represents the output of the i th SBox procedure, with p i denotes the i th byte of the plaintext.

Cross-subkey model training. As shown in
At the profiling stage, traces are divided into 16 sub-traces by analyzing the Point of Interest (POI), and each sub-trace is labeled by the corresponding SBox output. Generally, to recover the i th subkey, attackers train deep-learning models on sub-traces which are labeled by the i th SBox output. In the cross-subkey training, we go to one step further by adding a small amount sub-traces which represent the other 15 SBox operations into the training set.
We divided the experiment into two parts (notice: the number of training sets in this paper is 5K): • Verifying the validity of cross-subkey training (total training set 5K constant). We define the proportion of subtraces of the target subkey to the total training set as x ∈ [1,16] . Thus the proportion of other subkeys in the training set is 16 − x . The other 15 subkeys are average distributed in the training set. • Applying cross-subkey training (total training set is increased by 5K at a time). We use all the power traces of the target subkey (5K in this paper) for training, and add an equal number of power traces (5K) to the training set at a time as the number of target power traces, which are provided by the other 15 subkeys. The training set is thus 5K × y(y ∈ [1,16]) , where 5K × (y − 1) is equally distributed in the training set by the other 15 subkeys.

Experimental results
In this section, we first introduce the DL model structure. Afterwards, the training setup is presented. Finally we show the experimental results of the cross-subkey side channel analysis method on three datasets. We use the ρ-test as a leak detection method 22  Training setup. We divide the experiment into two parts the first part in order to demonstrate that the inclusion of sub-traces of non-target subkeys positively influences the training of the model, and the second part for the application of the cross-subkey approach to the experiment.
Part I. We know that data augmentation increases the amount of training data by adding minor alterations to the existing training traces. However, too many alterations in the training set may confuse the neural network. So to find the optimal amount of augmenting traces in the training set becomes a realistic problem. Thus, for www.nature.com/scientificreports/ each database, we build 16 different training sets, which contains different amount of augmenting traces to train 16 deep-learning models. Figure 1 shows an example of how these training sets are built. We call these training sets from set 1 to set 16 . Suppose the database contains x traces for training and we divide each trace to 16 segments as shown in Fig. 1, which are related to 16  Part II. In image classification, data enhancement methods are often used such as cropping, rotating, flipping, deflating and shifting 23 . These methods are essentially a series of changes to the original data in order to expand the number of training sets on which the models are trained. In the Part I, we do not change the number of training sets on which the models are trained. The main work in this part is to use all the traces of the target subkey and expand the training set with other subtraces of non-target subkeys for the purpose of data augmentation. Assuming that the database contains x training traces, similar to the work in Part I, we will also train 16 models. The training sets of 16 models are denoted by set 1 to set 16 . The amount of data in set y (y ∈ [1,16]) is x × y , where Results on software AES-128 implementation on STM32F3 (AES_STM32). The first dataset is captured by a ChipWhisperer-Lite 24 device at a sampling frequency of 40MHz. The experimental target cryptographic board is the CW308T-STM32F3, and the target cryptographic chip is the Arm Cortex M4, which runs the cryptographic algorithm TinyAES-128. The encryption mode of operation is the Electric Code Book (ECB) mode.
For the first round of the AES algorithm 11K power traces are captured as the data used for the experiments. Of these, 6K uses random plaintexts and random keys, 5K is used as the training set and 1K is used as the validation set. The remaining 5K are used as the testing set for the experiments using fixed-key random plaintexts. Each power trace has 750 sampling points and contains all SubBytes from the first round. This is shown by Fig. 4. We call this homemade dataset AES_STM32 in brief.
Results. Experiment I We first used the ρ-test to detect the POI of each subkey, as shown in Fig. 5a. The POI of each subkey in this dataset corresponds to 40 sampling points on the trace. Specifically, the trace segment for the first SBox operation is [28 : 68] (The 1 st subkey as the target subkey). Figure 5b shows how we allowed to synchronise segments for different bytes of the subkeys. In this experiment we generated 16 training sets, called set 1 , set 2 , ..., set 16 , based on the training method in Part I. Each training set contains 5K traces, with 1K of data for the target subkey as the validation set, which will be saved during model training when the model is at its highest  www.nature.com/scientificreports/ accuracy in the validation set. The testing set is the one containing 5K traces of the target subkey, and we also tested the other subkeys, which also contained 5K traces of the corresponding subkeys. Afterwards, model M 1 , M 2 , ..., M 16 is trained on the corresponding training set respectively. The training batch_size are set to 256 and the maximum number of epochs is 500 and the learning rate is 0.0005. Since the optimiser RMSprop is random in updating parameters, we have trained each model 10 times and taken the mean value as the experimental result. Table.1 shows the accuracies of the 16 models on the full testing set of subkey. where M i (i ∈ [1,16]) denote the models and S i (i ∈ [1,16]) denote the testing set of different subkeys, e.g. the first column in the first row shows the accuracy of M 1 on the testing set of the first subkey (accuracy figures are in percentages, with the % omitted at the end). We found that model M 15 had the highest accuracy on the testing set of the first subkey. Because the training set of model M 16 is the full trace of the first subkey, a model trained by means of cross-subkey will be 6.52% more accurate than a model trained traditionally on a one-to-one approach. Next, we show the results of the trace number increase in the training set.
Experiment II Again in this subsection the first subkey is used as the target subkey. In contrast to Experiment I the number of training sets for each model is increasing when training the cross-subkey model, with the training set being increased by 5K traces at a time, and these 5K traces being equally distributed among the sub-traces of the other non-target subkeys. Where the training set for M 1 is all the traces of the first subkey and the training set for M 16 is all the traces of all subkeys. The models M i (i ∈ [1,16]) is then trained on the corresponding dataset. The other hyperparameters are the same as Experiment I . Finally each model is trained 10 times and the results on the testing sets of different subkeys are taken as the mean value for the experimental results. Table.2 shows the accuracy of the 16 models on the testing set of all subkeys, where M i denotes the models and S i (i ∈ [1,16]) denotes the testing set of different subkeys (accuracy figures are in percentages, with the % omitted at the end). We found that model M 10 had the highest accuracy on the testing set of the first subkey. It is 28.20% more accurate than the traditional one-to-one trained model M 1 on the testing set of the first subkey.  www.nature.com/scientificreports/ This is because classification accuracy partially reflects the effectiveness of the models on SCA. Next, we evaluated M i and M i on the testing set S 1 using Key Rank and PGE. The results are shown in Table.3. Because of the higher AES_STM32 signal-to-noise ratio, the higher classification accuracy of the DL models on the testing set S 1 and the lower number of traces needed to recover the correct key, for this dataset mainly Key Rank < 5 is used for the experimental comparison (the larger the number of traces with Key Rank < 5, the more efficient the DL models are at SCA). The final prediction result of M 15 on the testing set has 650 more traces than the prediction result of M 16 on the testing set Key Rank, and the prediction result of M 10 on the testing set S 1 has 2064 more traces than the prediction result of M 1 on the testing set S 1 Key Rank (where M 16 and M 1 are trained with the same training set).
Next, we validate the method on two well-known publicly available datasets.
Results on software AES-128 implementation on ATXMEGA128D4 (AES_XMEGA). The second dataset is captured using an 8-bit ATMEL microcontroller, the ATXMEGA128D4, and all the power traces generated during the encryption process are extracted using chipwhisperer to form this paper's dataset, with the encryption mode being TinyAES-128's electrical codebook (ECB) mode. The training, validation and testing sets of this dataset are set up in the same way as the first dataset. We call this dataset AES _ XMEGA in brief. Each power trace has 1700 sampling points and contains all SubBytes from the first round. This is shown by Fig. 6. Specific information on this dataset can be found in Literature 13 .
Results. The ρ-test is first used to locate the POI of the 16 subkeys on the traces, as shown in Fig. 7a. Each subkey leakage interval contains 90 sample points. Our target subkey is the first byte of the SBox output, which corresponds to a leakage interval of [858: 948]. Figure 7b shows how we allowed to synchronise segments for different bytes of the subkeys. The other experimental configurations are identical to the first dataset. During the training of the DL models, we used the RMSprop optimizer with a learning rate of 0.001. The mini-batch size is 256 and the maximum iterative epoch is 500. Next, the DL models are trained on a training sets that don't change the number of traces contained in the training sets, which is denoted M i . Finally, the DL models are trained on an increasing number of traces contained in the training sets, which is denoted M i . Table 4 shows the classification accuracies (accuracy figures are in percentages, with the % omitted at the end), Key Rank and PEG of the DL models trained with a constant number of traces in the training sets ( M i )  Figure 9b shows how we allowed to synchronise segments for different bytes of the subkeys. The other experimental configurations are identical to the first dataset. During the training of the DL models, we used the RMSprop optimizer with a learning rate of 0.0001. The mini-batch size is 256 and the maximum iterative epoch is 500. Next, the DL models are trained on a training sets that don't change the number of traces contained in the training sets, which is denoted M i . Finally, the DL models are trained on an increasing number of traces contained in the training sets, which is denoted M i . Table 5 shows the classification accuracies (accuracy figures are in percentages, with the % omitted at the end), Key Rank < 5 and PGE (cannot recover the correct key replace with "-") of the DL models trained with a constant number of traces in the training sets ( M i ) and the DL models trained with a training sets with an increasing number of traces in the training sets ( M i ) on the testing set S 16 for the 16 th subkey. Since the target subkey is the 16 th subkey, we only show the classification accuracies of the DL models on the testing set S 16 for the 16 th subkey, and the training process for the other subkeys is the same as for the target subkey.
The results show that when the size of the training sets don't change, M 15 trained with cross-subkey have a 0.28% higher classification accuracy than M 16   Discussion. We set up two sets of experiments to validate on the homebrew dataset AES_STM32 and the public datasets AES_XMEGA, AES_GPU respectively. In Experiment I, the number of traces in the training set used when each model is trained is constant, and what is changed is the proportion of subtraces of the target subkey and subtraces of the non-target subkey in the training set. Because the model structure and hyperparameters are identical for the 16 models, only one independent variable, the training set, is used during the experiments.
The experimental results show that by varying the proportion of target and non-target subkeys in the training sets (i.e. training the DL models using cross-subkey) when the size of the training set does not change, the final experimental results are improved in all three datasets. Because of the random nature of the iterative process of the parameters during the training of the neural network, we have repeated the training 10 times for each model and took the average accuracy, Key Rank and PGE of each DL model on the testing set with different subkeys as the experimental results. Experiment I is designed to validate the effectiveness of the cross-subkey training model. Model M 1 is trained using the full trace of the target subkeys. Model M i (i ∈ [2,16]) is trained using a training set that is expanded with sub-traces of non-target subkeys. In AES_STM32, M 10 improved classification accuracy by 28.20% over M 1 on the testing set S 1 , with an increase of 2064 traces for Key Rank < 5 and a decrease of 3 traces for PGE. In AES_XMEGA, M 12 improved classification accuracy over M 1 on the testing set S 1 by 46.15%, the number of Key Rank < 5 traces increased by 1016, and the number of PGE traces decreased by 1. In AES_GPU, M 13 improved classification accuracy over M 1 on the testing set S 16 by 0.67%, the number of Key Rank < 5 traces increased by 199, and the number of PGE traces decreased by 534. The results of Experiment II showed that by using the non-target subkeys traces to expand the training set obtained twofold better results than the model trained with the target subkeys.
Finally, when training the model, if a trace of a non-target subkey is added to the training set, the model is equally effective on the testing set of non-target subkeys. This result suggests that the traditional approach of one model recovering one subkey can be replaced by one model recovering all subkeys.

Conclusion
In this paper, we propose a cross-subkey deep-learning side-channel analysis, which utilizes the additional synthetically modified power traces as a data augmentation to build models with a better fitting capability. Our results show that the accuracy, Key Rank and PGE of the models on the testing set can be improved by adding traces of other subkeys to the training set of the target subkeys when the traces of the capture are limited. This paper validates the effectiveness of the cross-subkey training models on the homebrew dataset AES_STM32 and the publicly available datasets AES_XMEGA, AES_GPU, but there are still many open rows for the links between different subkeys. As mentioned in the previous sections, there are many possible directions of research regarding the connections between different subkeys, which will ultimately bring more cohesion to the field and more confidence in the results obtained.